Final Project STAT 331
Reproducibility
All code, raw data, and project files are available in our GitHub Repository. Feel free to explore or replicate our analysis!
1 Project Proposal + Data
This analysis utilizes the life expectancy and the gross domestic product (GDP) datasets sourced Gapminder, a non-profit organization whose mission “is to fight devastating ignorance with a fact-based world view everyone could understand.” Their site provides data sets collected from many reputable sources and interactive visualizations on important world topics. Additionally, we used a countries by continent dataset from Daina Bouquin’s dataset posted on Kaggle. The continents dataset provides continent classifications for countries, allowing us conduct continent-specific analysis in our study.
1.1 Data Cleaning
In the raw GDP dataset, some values included a “k” suffix to represent thousands of dollars (e.g., 10,000 to 10k). The first step is to figure out a way to convert GDP values into numeric form. To keep values constant, we created a function that converts these abbreviated values into their full numeric form, allowing for accurate numeric comparisons. Without this step, any observations containing a “k” would be dropped, leaving it empty and could potentially affecting later analysis.
1.2 Pivoting Longer
The life expectancy data contains information about the life expectancy for 196 countries from the year 1800 to 2100. It provides the life expectancy in years for each country within the set. For the period from 1800 to 1970, the data was sourced from Gapminder’s main source v7: by Mattias Lindgren. Data for 1950-2019 was from the Global Burden of Disease Study 2019, which has 1950-2019 from the IHME. For 2020-2100, Gapminder used UN forecasts from the World Population Prospects 2022.
Life Expectancy Info from: https://www.gapminder.org/data/documentation/gd004
The GDP data was obtained from the Madison Project Database (MPD) and Penn World Table (PWT). This data set contains information on gross domestic product (GDP) per person adjusted for differences in purchasing power in international dollars, and fixed 2017 prices. GDP per capita measures the value of everything a country produces during a year, divided by the number of people. We transformed the data to have columns containing the country, year, and GDP of interest.
GDP Info from: https://www.gapminder.org/data/documentation/gd001/
We transformed each of the individual year columns into one singular column so that the dataset would be easier to read. As a result, each observation consists of one country and year, with the corresponding life expectancy. The raw GDP data is similar to the life expectancy data in that each year has its own column. So we transformed the data in a similar way, making year its own column with its corresponding GDP.
1.3 Joining Datasets
After cleaning up each data set, we had to join the two together by our observational unit, country. We hypothesize that as GDP increases, life expectancy will also begin to increase, as a higher GDP correlates to better infrastructure and more/better access to healthcare and medicine.
2 Linear Regressions
2.1 Data Visualization
2.2 Linear Regression
Call:
lm(formula = life_expectancy ~ avg_gdp, data = gdp_lex_mean)
Residuals:
Min 1Q Median 3Q Max
-59.329 -19.294 -2.221 20.228 40.524
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.809e+01 1.306e-01 368.26 <2e-16 ***
avg_gdp 4.073e-04 7.198e-06 56.58 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.87 on 57343 degrees of freedom
Multiple R-squared: 0.05288, Adjusted R-squared: 0.05286
F-statistic: 3202 on 1 and 57343 DF, p-value: < 2.2e-16
| Regression Model Estimates by Continent | ||||
|---|---|---|---|---|
| Average Life Expectancy vs Average GDP | ||||
| Continent |
Model Estimates
|
|||
| Estimate | Std. Error | t-Statistic | p-Value1 | |
| Asia | ||||
| (Intercept) | 49.612363 | 0.967254 | 51.291992 | 0.000000 |
| avg_gdp | 0.000151 | 0.000054 | 2.826570 | 0.007052 |
| Africa | ||||
| (Intercept) | 45.545427 | 0.468580 | 97.198742 | 0.000000 |
| avg_gdp | 0.000608 | 0.000076 | 7.955113 | 0.000000 |
| Europe | ||||
| (Intercept) | 53.359873 | 0.980483 | 54.422023 | 0.000000 |
| avg_gdp | 0.000296 | 0.000031 | 9.637500 | 0.000000 |
| South America | ||||
| (Intercept) | 55.974115 | 1.968199 | 28.439260 | 0.000000 |
| avg_gdp | −0.000080 | 0.000135 | −0.594425 | 0.565433 |
| North America | ||||
| (Intercept) | 48.092682 | 1.902205 | 25.282599 | 0.000000 |
| avg_gdp | 0.000591 | 0.000114 | 5.162279 | 0.000041 |
| Oceania | ||||
| (Intercept) | 47.159429 | 2.647934 | 17.809898 | 0.000000 |
| avg_gdp | 0.000724 | 0.000177 | 4.099796 | 0.001473 |
| 1 P-values below 0.05 indicate statistical significance. | ||||
\[ \hat{y} = 49.6 + 0.000151x \]
Intercept: When the average GDP of Asia is $0, the average life expectancy in Asia is 49.6 years.
Slope: For each additional $1 increase in average GDP in Asia, the life expectancy of a person in Asia will increase by 0.000151 years.
\[ \hat{y} = 45.5 + 0.000608x \]
Intercept: When the average GDP of Africa is \(0\) dollars, the average life expectancy in Africa is \(45.5\) years.
Slope: For each additional \(1\) dollar increase in average GDP in Africa, the life expectancy of a person in Africa will increase by \(0.000608\) years.
\[ \hat{y} = 53.4 + 0.000296x \]
Intercept: When the average GDP of Europe is \(0\) dollars, the average life expectancy in Europe is \(53.4\) years.
Slope: For each additional \(1\) dollar increase in average GDP in Europe, the life expectancy of a person in Europe will increase by \(0.000296\) years.
\[ \hat{y} = 56 - 0.0000803x \]
Intercept: When the average GDP of South America is \(0\) dollars, the average life expectancy in South America is \(56\) years.
Slope: For each additional \(1\) dollar increase in average GDP in South America, the life expectancy of a person in South America will decrease by \(0.0000803\) years.
\[ \hat{y} = 48.1 + 0.000591x \]
Intercept: When the average GDP of North America is \(0\) dollars, the average life expectancy in North America is \(48.1\) years.
Slope: For each additional \(1\) dollar increase in average GDP in North America, the life expectancy of a person in North America will increase by \(0.000591\) years.
\[ \hat{y} = 47.2 + 0.000724x \]
Intercept: When the average GDP of Oceania is \(0\) dollars, the average life expectancy in Oceania is \(47.2\) years.
Slope: For each additional \(1\) dollar increase in average GDP in Oceania, the life expectancy of a person in Oceania will increase by \(0.000724\) years.
2.3 Model Fit
| Response | Fitted Values | Residuals | |
|---|---|---|---|
| 459.75 | 24.31 | 435.43 |